| Redcedar | Data | Analyses | Instructions |
Please note this analysis and R Markdown document are in still in development :)
The overall approach is to model empirical data collected by community scientists with ancillary climate data to identify important predictors of western redcedar dieback.
The steps for wrangling the iNat data are described here.
Data were subset to include only gps information to use in collecting ancillary data.
Climate data then extracted with ClimateNA tool following the below process. Data were downloaded for the iNat GPS locations using the ClimateNA Tool.
ClimateNA version 7.42 -
Variables
Note the below analysis uses the iNat data with 1510 observations. Amazing!
Remove specific climate variables not useful as explanatory variables (e.g. norm_Latitutde)
For some reason there is one observation with a super neg cmi value (-10000 ish)
Normals data for 265 variables were downloaded for each point Monthly - 180 variables represented data averaged over months for the 30 year period Seasonal - 60 variables respresented data averaged over 3 month seasons (4 seasons) for 30 year period Annual - 20 variables represented data averaged for all years during 30 year period
Remove variables with variables that have near zero standard deviations (entire column is same value)
Full
There were length(normals)-length(normals.nearzerovar
monthly variables with zero standard deviation is Dropping columns with
near zero standard deviation removed
length(normals)-length(normals.nearzerovar monthly climate
variables.
Monthly
There were
length(normals.monthly)-length(normals.monthly.nearzerovar
monthly variables with zero standard deviation is Dropping columns with
near zero standard deviation removed
length(normals.monthly)-length(normals.monthly.nearzerovar
monthly climate variables.
Seasonal
There were
length(normals.monthly)-length(normals.seasonal.nearzerovar
monthly variables with zero standard deviation is
Annual
There were
length(normals.monthly)-length(normals.annual.nearzerovar
monthly variables with zero standard deviation.
Remove other explanatory variable categories (binary or five categories)
##
## Call:
## randomForest(formula = reclassified.tree.canopy.symptoms ~ ., data = five.cats.full, ntree = 2001, importance = TRUE, proximity = TRUE, na.action = na.omit)
## Type of random forest: classification
## Number of trees: 2001
## No. of variables tried at each split: 14
##
## OOB estimate of error rate: 42.19%
## Confusion matrix:
## Dead Top Healthy Other Thinning Canopy Tree is Dead class.error
## Dead Top 84 133 28 32 21 0.7181208
## Healthy 66 1201 95 78 32 0.1841033
## Other 25 170 71 38 12 0.7753165
## Thinning Canopy 33 175 36 88 7 0.7404130
## Tree is Dead 25 50 9 12 32 0.7500000
##
## Call:
## randomForest(formula = reclassified.tree.canopy.symptoms ~ ., data = five.cats.monthly, ntree = 2001, importance = TRUE, proximity = TRUE, na.action = na.omit)
## Type of random forest: classification
## Number of trees: 2001
## No. of variables tried at each split: 12
##
## OOB estimate of error rate: 42.15%
## Confusion matrix:
## Dead Top Healthy Other Thinning Canopy Tree is Dead class.error
## Dead Top 83 136 28 31 20 0.7214765
## Healthy 64 1209 90 78 31 0.1786685
## Other 25 177 69 33 12 0.7816456
## Thinning Canopy 33 181 33 84 8 0.7522124
## Tree is Dead 24 52 9 11 32 0.7500000
##
## Call:
## randomForest(formula = reclassified.tree.canopy.symptoms ~ ., data = five.cats.seasonal, ntree = 2001, importance = TRUE, proximity = TRUE, na.action = na.omit)
## Type of random forest: classification
## Number of trees: 2001
## No. of variables tried at each split: 7
##
## OOB estimate of error rate: 43.01%
## Confusion matrix:
## Dead Top Healthy Other Thinning Canopy Tree is Dead class.error
## Dead Top 82 139 26 32 19 0.7248322
## Healthy 69 1195 96 82 30 0.1881793
## Other 27 174 66 38 11 0.7911392
## Thinning Canopy 36 177 37 80 9 0.7640118
## Tree is Dead 25 53 7 11 32 0.7500000
##
## Call:
## randomForest(formula = reclassified.tree.canopy.symptoms ~ ., data = five.cats.annual, ntree = 2001, importance = TRUE, proximity = TRUE, na.action = na.omit)
## Type of random forest: classification
## Number of trees: 2001
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 42.66%
## Confusion matrix:
## Dead Top Healthy Other Thinning Canopy Tree is Dead class.error
## Dead Top 85 135 26 32 20 0.7147651
## Healthy 68 1193 94 82 35 0.1895380
## Other 26 178 66 38 8 0.7911392
## Thinning Canopy 32 180 32 87 8 0.7433628
## Tree is Dead 24 51 9 11 33 0.7421875
##
## Call:
## randomForest(formula = binary.tree.canopy.symptoms ~ ., data = binary.full, ntree = 2001, importance = TRUE, proximity = TRUE, na.action = na.omit)
## Type of random forest: classification
## Number of trees: 2001
## No. of variables tried at each split: 14
##
## OOB estimate of error rate: 31.49%
## Confusion matrix:
## Healthy Unhealthy class.error
## Healthy 1107 365 0.2479620
## Unhealthy 439 642 0.4061055
##
## Call:
## randomForest(formula = binary.tree.canopy.symptoms ~ ., data = binary.monthly, ntree = 2001, importance = TRUE, proximity = TRUE, na.action = na.omit)
## Type of random forest: classification
## Number of trees: 2001
## No. of variables tried at each split: 12
##
## OOB estimate of error rate: 31.84%
## Confusion matrix:
## Healthy Unhealthy class.error
## Healthy 1099 373 0.2533967
## Unhealthy 440 641 0.4070305
##
## Call:
## randomForest(formula = binary.tree.canopy.symptoms ~ ., data = binary.seasonal, ntree = 2001, importance = TRUE, proximity = TRUE, na.action = na.omit)
## Type of random forest: classification
## Number of trees: 2001
## No. of variables tried at each split: 7
##
## OOB estimate of error rate: 32.47%
## Confusion matrix:
## Healthy Unhealthy class.error
## Healthy 1094 378 0.2567935
## Unhealthy 451 630 0.4172063
##
## Call:
## randomForest(formula = binary.tree.canopy.symptoms ~ ., data = binary.annual, ntree = 2001, importance = TRUE, proximity = TRUE, na.action = na.omit)
## Type of random forest: classification
## Number of trees: 2001
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 32.47%
## Confusion matrix:
## Healthy Unhealthy class.error
## Healthy 1091 381 0.2588315
## Unhealthy 448 633 0.4144311
Summary of model performance
| Response | Explanatory | Vars tried split | OOB Error |
| 5 class | Full | 14 | 42.56 |
| 5 class | Monthly | 12 | 42.6 |
| 5 class | Seasonal | 7 | 42.83 |
| 5 class | Annual | 4 | 42.79 |
| Binary | Full | 14 | 31.61 |
| Binary | Monthly | 12 | 32.07 |
| Binary | Seasonal | 7 | 32.58 |
| Binary | Annual | 4 | 32.11 |
2001 trees is overkill because the error rate stabilizies after about 800 runs.
Clearly all of the climate variables are highly correlated.
Lets pick the top performing metric in our random forests analyses, CMI and then any less correlated variables
Below we can check the correlation of CMI, MAP, and DD_18
Now we can check how the model performs with only these three climate variables
##
## Call:
## randomForest(formula = binary.tree.canopy.symptoms ~ ., data = binary.annual, ntree = 2001, importance = TRUE, proximity = TRUE, na.action = na.omit)
## Type of random forest: classification
## Number of trees: 2001
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 32.47%
## Confusion matrix:
## Healthy Unhealthy class.error
## Healthy 1091 381 0.2588315
## Unhealthy 448 633 0.4144311
It’s hard to give up the seasonality data, but they are all highly correlated (data not shown) and if we look at the above importance plot for the seasonality data, the winter variables (norm_CMI_wt,norm_DD_18_wt, and norm_PPT_wt) all had the highest MeanDecrease Accuracy and Gini. Therefore, even if we chose to build the model on seasonal data, we would likely want to choose to use the winter values for each variable.